asr service
Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Mei, Katelyn Xiaoying, Choi, Anna Seo Gyeong, Schellmann, Hilke, Sloane, Mona, Koenecke, Allison
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area (0.92)
- Government (0.67)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)
You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish
Cumbal, Ronald, Moell, Birger, Lopes, Jose, Engwall, Olof
The performance of Automatic Speech Recognition (ASR) systems has constantly increased in state-of-the-art development. However, performance tends to decrease considerably in more challenging conditions (e.g., background noise, multiple speaker social conversations) and with more atypical speakers (e.g., children, non-native speakers or people with speech disorders), which signifies that general improvements do not necessarily transfer to applications that rely on ASR, e.g., educational software for younger students or language learners. In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.
- Europe > Sweden (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom (0.04)
- Research Report > Experimental Study (0.47)
- Research Report > New Finding (0.35)
Global Performance Disparities Between English-Language Accents in Automatic Speech Recognition
DiChristofano, Alex, Shuster, Henry, Chandra, Shefali, Patwari, Neal
However, many users are familiar with the frustrating experience of repeatedly not being understood by their voice assistant [16], so much so that frustration with ASR has become a culturally-shared source of comedy [4, 32]. Bias auditing of ASR services has quantified these experiences. English language ASR has higher error rates: for Black Americans compared to white Americans [24, 45], for stigmatised British accents compared to favored British accents [28], for Scottish speakers compared to speakers from California and New Zealand [44], for speakers whose first language is a tone language compared to those whose first language is not [2], for speakers with Indian accents compared to speakers who with "American" accents [31], for speakers whose first language is English compared to those for whom it is not [28]. It should go without saying, but everyone has an accent - there is no "unaccented" version of English [26]. Due to colonization and globalization, different Englishes are spoken around the world. While some English accents may be favored by those with class, race, and national origin privilege [28], there is no technical barrier to building an ASR system which works well on any particular accent. So we are left with the question, why does ASR performance vary as it does as a function of the global English accent spoken?
- Oceania > New Zealand (0.24)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (25 more...)
- Health & Medicine (1.00)
- Education (0.93)
- Government > Regional Government (0.46)
SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale
Tang, Raphael, Kumar, Karun, Yang, Gefei, Pandey, Akshat, Mao, Yajie, Belyaev, Vladislav, Emmadi, Madhuri, Murray, Craig, Ture, Ferhan, Lin, Jimmy
End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television. To our knowledge, this is the first time a large-scale, Wav2vec-based deployment has been described in the academic literature.
Why AI startups have different economics from classic SaaS startups
Let's rewind the clock a bit. Back in the day, software vendors would write code, package it, and often distribute physically (through those nifty things called CDs). In this old world, buyers were shouldering most of the operational costs, such as running the applications that they bought on their own local data and compute centers (or laptops and desktops). Then came the advent of faster Internet speeds and cloud computing, which really opened up software development and deployment to a whole new world. With that, we started to see a dramatic shift of infrastructure costs back to the software vendor. That is, under the SaaS world, vendors host and manage web apps in their own data centers or cloud environments, allowing buyers to gradually decrease their investment and expenses associated with managing infrastructure.
- Information Technology > Cloud Computing (1.00)
- Information Technology > Communications > Web (0.95)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.31)